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Abstract 



We propose a novel approach, Sequential Lasso, for feature selection in linear 
regression models with ultra-high dimensional feature spaces. We investigate in 
d • this article the asymptotic properties of Sequential Lasso and establish its selection 

c/3 I consistency. Like other sequential methods, the implementation of Sequential Lasso 

is not limited by the dimensionality of the feature space. It has advantages over 
other sequential methods. The simulation studies comparing Sequential Lasso with 
^ ' other sequential methods are reported. . 

cn . Key Words: extended BIC; feature selection; selection consistency; Sequential Lasso; 

ultra-high dimensionality. 



1 Introduction 

The so-called small-n-large-p problems are abundant in many important contemporary 
scientific fields. A small-n-large-p problem refers to the situation where the number 
of covariates is huge, though only a few of them are causally related to the response 
variable under study, but the sample size is relatively small. There are two related 
but different purposes in the study of small-n-large-p problems: (i) to build a model 
with good prediction properties and (ii) to identify the covariates which are causally 
related to the response variable. The classical least square regression approach is no 
longer appropriate in small-n-large-p problems because of the sheer huge number of the 
covariates. Feature selection becomes crucial (by a feature we mean a covariate or a 
function of covariates such as the product of any two covariates, etc.). 

Since the seminal paper on LASSO |12j published in 1996, a great interest has been 
focused on penalized likelihood methods. These methods include SCAD [5], Elastic 
net [IB], Adaptive LASSO [H], Bridge [8], etc., to name but a few. A penalized likeli- 
hood method selects variables and estimates the coefficients at the same time. Under 
certain conditions, the various penalized likelihood methods have the so-called oracle 
property; that is, asymptotically, the set of causal variables can be identified exactly 
and the estimated coefficients are L2 consistent, see [5] [9] [10] [IT] [H]. However, the 
conditions for the various penalized likelihood methods to achieve the oracle property 
are usually not met when the dimension of the feature space has a polynomial order 
or an exponential order of the sample size. The computation also poses a challenge in 
this case. 

Feature selection procedures of a stepwise nature are computationally appealing. 
Efron et al [4] proposed a sequential procedure called least angle regression (LAR). 
With slight modification, the algorithm of LAR can also compute the solution path of 
LASSO sequentially, which made LASSO more popular. The classical forward stepwise 
regression (FSR) has been recently re-examined in [15] on its properties in feature selec- 
tion with ultra-high dimensional feature space. A different version of forward stepwise 
regression referred to as forward selection in [H] has been re-considered recently and 
dubbed as orthogonal matching pursuit (OMP), see [I] [l3] [2]. The difference be- 
tween FSR and OMP is that the former selects at each step the covariate that reduces 
the residual sum of squares the most while the latter selects the covariate that has the 
largest absolute correlation with the current residuals. 

In many practical problems, the identification of the set of causal features is of 
primary interest. For example, in genetic quantitative trait loci (QTL) mapping and 
disease gene mapping, of interest are the markers which are either QTL or disease gene 
themselves or are in linkage disequilibrium with QTL or disease genes. More relevant 
properties required of a feature selection method for this purpose are sure screening 
[6] and selection consistency. The sure screening property means that the selected 
set should contain the set of causal features with probability converging to 1. The 
selection consistency means that the selected set should be the same as the exact set of 
causal features with probability converging to 1. Under the well-known ir represent able 
condition, the LASSO has been shown to possess the property of selection consistency 



while the penalty parameter is properly chosen [T7]. If the covariance matrix of the 
vector of the covariates has eigenvalues bounded both from above and away from zero in 
addition to some other assumptions, it is established in [15] that the FSR has the sure 
screening property when the procedure is carried out at a certain step before the number 
of steps reaches the sample size. The OMP has been studied under conditions called 
Exact Recovery Condition (ERC) [2] [13] and Mutual Incoherence Property (MIP) [1]. 
The ERC is similar to the irrepresentable condition but much stronger. The MIP is 
the condition that Pmax < 2F^ where /Omax is the largest absolute correlation among 
all pairs of covariates and k is the number of causal covariates. The ERC implies MIP, 
see [1] [13]. Both the sure screening property and the selection consistency of OMP 
have been examined in [1] under MIP together with other conditions. 

We propose an alternative sequential feature selection procedure which we called 
sequential LASSO. The procedure starts with the original LASSO and the penalty 
parameter is tuned to the largest which allows some coefficients to be estimated nonzero. 
The features with nonzero estimated coefficients form the current active set. In the 
following steps, a partially penalized sum of squares is considered. The coefficients 
of the features in the current active set are not penalized but the coefficients of all 
the other features are. Then the penalty parameter is again tuned to the largest 
which allows some features outside the current active set to have estimated nonzero 
coefficients. The current active set is then updated by adding these new features 
with nonzero estimated coefficients. The procedure continues until it meets a certain 
stopping rule. We investigate the properties of the sequential LASSO in this article. 
We establish its selection consistency in the situation that the dimension of the feature 
space is of an exponential order of the sample size and the number of causal features 
is allowed to diverge under conditions weaker than the conditions mentioned in the 
last paragraph. We provide some general special cases where the conditions required 
for the sequential LASSO to be selection consistent hold but the conditions mentioned 
in the last paragraph fail. The sequential LASSO bears some similarity with OMP. 
At steps where a partial positive cone condition is satisfied, the sequential LASSO 
selects new features with the same criterion as OMP. The properties established for 
the sequential LASSO then apply to OMP. Thus, we reveal some new properties of 
OMP other than those discovered in [1] [13] [13] ■ The stopping rule is given by 
the extended BIC (EBIC) proposed in [3]. The selection consistency of EBIC in the 
same situation is recently established under similar conditions in [11]. Thus, coupled 
with EBIC the sequential LASSO provides a practically applicable selection consistent 
method for feature selection in small-n-large-p problems. Simulation studies are carried 
out to compare the sequential LASSO with other stepwise methods such as FSR and 
the original LASSO. 

The remainder of the article is arranged as follows. The detailed procedure of the 
sequential LASSO and its basic properties are given in ^ The selection consistency of 
the sequential LASSO is studied in ^ Simulation studies are reported in ^ 



2 Procedure of Sequential LASSO and its basic properties 

Consider the linear regression model below: 

Pn 

2/i = ^0 + X^ l^njXij + ej, i = 1, . . . , n, (2.1) 



where ej's are i.i.d. normal variables with mean zero and variance a^, the Xjj's are called 
features which are either deterministically determined or observed at random. The 
following particular natures are assumed for the above model, (a) The dimensionality 
of the feature space is assumed as lnp„ = 0{n'^) for k > (ultra-high), (b) Let 
son = {j '■ l^nj 7^ 0} and let |son.| denote the cardinality of son- It is assumed that 
|son| = 0{n'^) for some < c < 1. (c) The magnitude of (3nj,j G soru is allowed to vary 
with n. In matrix notation, (|2.ip is expressed as 



where /3„ = (/3„i, . . . ,/3„p„)^, y„ = (yi,...,y„)^ and X„ = {xij) .=i,....„ and e„ = 

j=i,...,p„ 

(ei, . . . ,e„)'^. Let the columns of Xn be normalized such that ^Yll=i^ij — ^^^^ 

- X^iLi -^Ij ~ ^ ^°^ ^^ i- ^'^^ ^^^ sake of simplicity, the subscript n in the notation will 

be dropped afterwards. Let S denote the set of indices {1,2, • • • ,Pn\- The sequential 

LASSO is described as follows. 

• The procedure starts with the Li penalized sum of squares: 

where || • ||2 is the L2 norm of a vector, li is minimized by tuning A to a value 
such that it is the largest to allow some f3j nonzero in the minimizer. The set of 
indices of nonzero /3j's is denoted by s=ki and referred to as the active set. 

• In the second step, li is replaced by 

where s^^ denotes the complement of s=ki in S. Then I2 is minimized by tuning 
A to a value such that it is the largest to allow some /3j with j £ sj^ nonzero in 
the minimizer. The active set is updated as the set of all features with nonzero 
coefficient in this minimization and denoted by s=k2- 

• In general, after k steps have been carried out and the active set s^:k is obtained, 
the penalized sum of squares 

lk+i = \\y-X(3\\l + Xj2 W 

is then minimized by tuning A to a value such that it is the largest to allow some 
f3j with j £ s'^f^ nonzero in the minimizer, and the active set is updated as s^:k+i- 



• The process continues until some stopping criterion is met. 

The sequential LASSO described above selects features sequentially by different par- 
tially Li penalized sum of squares. Once a feature is selected at a certain step, its 
coefficient will no longer be penalized in the subsequent step, which ensures that the 
feature will always remain in the model, see the basic properties below. This differs 
from the ordinary LASSO where a feature included in an earlier stage could be left out 
in a later stage in the solution path. 

Let s be any subset of S. Denote by X(s) the matrix consisting of the columns of X 
with indices in s. Similarly, let /3(s) denote the vector consisting of the corresponding 
components of (3. Let TZ{s) be the linear space spanned by the columns of X(s) and 
H{s) denote its projection matrix, i.e, H{s) = X{s)[X'^{s)X{s)]~^X'^{s). Some basic 
properties of the sequential LASSO are given in the following. 

Proposition 2.1. For k>l and any I G sj^, if X{{1}) G Tl{s^k) then I s^^+i. 

Proof: If X({1}) £ TZ{s^:k) then there exists an a^ such that X({/}) = X(s*fc)afc and 
hence 

k+i = \\y-X{s,k){Pis.k)+l3iak)-X{stJ{l})(3istJ{iml + X{m+ Yl \f^\j 

= \\y-x{s,kms*k)-xistj{i})(3istj{i})\\i + xm+ Yl i/5b-) 

< \\y-X{s,kms,k)-X{s'iJ{l})f3{s'iJ{l})\\l + X ^ |/?|,. 

iG<fc/{«} 

Thus when l^+i is minimized there must be /3; = 0, i.e., / sj^^i. D 

Proposition 12.11 implies that, for any k, the matrix X(s^k) is of full column rank. It also 
suggests that, in the sequential LASSO procedure, any feature that is highly correlated 
with the features selected already will have little chance to be selected subsequently. 
This nature of the sequential LASSO is favorable when it is used for feature selection 
in ultra- high dimensional feature space where high spurious correlations present, see 

m- 

Proposition 2.2. For k >1, the minimization of Ik+i is equivalent to the minimiza- 
tion of 

\\[l-H{s,,)][y-X{sl,)P{s'i,)]f + X Y m- 

Proof: Differentiating l/^^i with respect to /3(s=Kfc), we have 

-^^ = -2X^s,k)y + 2X^s,k)X{s,kms,k) + 2X-(s,fc)X(s^fc)/3(s^J. 

Setting the above derivative to zero, we obtain 

P{s,k) = [X^{s.k)X{s,k)r'X^s,k)[y - X{slj,)f3{slk)]- (2.2) 



Substituting (|2.2p into \\y — Xf3\\'^ we have 






'J I 
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As a by-product of the above proof, the components of /3(,s^,fc) are almost surely nonzero 
since y is a vector of continuous random variables. This implies that, in the sequential 
LASSO, we have s*i C s*2 C • • • C 5=,,^ C • • • ; that is, the models selected in the 
sequential steps are nested. 

For a general k, let y = [I - H{s^k)]y, X = [I - H{s^.k)]X{s''^^), P = /3(s^fc) and 
^fc ~ \^*k\- Then by Proposition 12.21 the minimization of l^+i is equivalent to the 
minimization of 

lk+i = \\y-XPf + XY,\^A- (2-3) 

The following proposition is the Karush-Kuhn- Tucker (KKT) condition for the solution 
of the above minimization problem. 



Proposition 2.3 (KKT condition). Let 

d\x\ 



( 1, ifx>0, 
-1, ifx<0, 
r, if X = 0, 



where r is an arbitrary number with \r\ < 1. Let d\\f3\\i = (9|/?i|, . . . ,d\l3^-\y . Then 
is a minimizer of ^2. S^) if 

2X^{y-XP) = \dmi- 

Proof: We only need to verify that the form of (?||/3||i given above is the sufficient and 
necessary condition for a sub gradient of ||/3||i. First, for any ^, we have 

ii^iii-Piii = E iU-\H) 

Thus by definition 5||/3||i is a sub gradient. 



Next, let w be any sub gradient of \\f3\\i- We show that 

r 1, if pj > 0, 

Wj = < -1, if /3j < 0, 
( r, if /3, = 0. 

Suppose (3j = and assume \wj\ > 1. Then we can define a new vector ^ such that 
(,j = sign(u;j) and (,{ = f3i for i / j. Then we have ||^||i — ||/3||i = 1 < w'^[^ — (3) = \wj\, 
contradicting to that lu is a sub gradient. 

Now suppose f5j ^ 0. For a positive number 5 < |/3j|, define $,i and ^2 such that 
Cij = /3j + 5sign(^j), ^2j = Pj - '^sign(/3j) and ^ij = ^21 = ft,^ / j- Since w is a sub 
gradient we must have 

ll^illi - Pill = 5 > w-{^, -~P) = 6wjsign0j), 
ll^2lli - Pill = -'^ > w^{^2 - 3) = -5^,sign(/3,), 

which imphes u;jsign(/3j) = 1 and hence Wj = sign(/3j). 

D 

In the remainder of this section, we highhght the difference of the sequential LASSO 
from FSR and OMP. First, consider the difference between the sequential LASSO and 
FSR. After the sub model s*^ is selected, the sequential LASSO selects the next feature 
among the features that maximize 

gi{j) = \XJ[I-H{s,k)]y\, 

see the proof of Theorem 13.11 in ^ The FSR selects the next feature by minimizing 
RSS(j) = y'^[I — H{s^k U {j})]y which is equivalent to maximizing 

\XJ[I-H{s,,)]y\ 
92U) -- 



/XJ[I-H{s,k)]Xj 
The equivalence is established by the following identity 

I - H(s.. u 0-)) = [/ - ms.,)i (^/ - ^i/l^,^;,,^ ] . 

The sequential LASSO selects the next feature that has the highest correlation with 
the current residual [I — H(s^:k)]y but the FSR selects the next feature that has the 
highest inflated correlation with an inflating factor [^H-^ ~ H{s^,k)]Xj]~^''^. If Xj is 
orthogonal to lZ{s^k), the factor is a constant (note that the ATj's are standardized), 
but larger than the constant otherwise. The more correlated the Xj is with the features 
in s=i,fc, the larger the inflating factor. If two features have the same absolute correlation 
with the current residual, the FSR will select the one that is more correlated with the 
features in s*fc. If one feature has a lower correlation with the current residual but 



is more correlated with the features in s*fc than another feature, it might turn out 
that this feature has a higher inflated correlation and is selected by FSR. Obviously, 
this is a disadvantage of FSR, especially when high spurious correlations present in 
small-n-large-p problems. 

The OMP selects the next feature (or features) maximizing gi{j). At steps where 
there is only one feature that maximizes gi{j), the sequential LASSO and the OMP 
select the same next feature. But at steps where there are more than one features that 
maximize gi{j), there is a difference between the sequential LASSO and the OMP. The 
OMP selects all those features. But the sequential LASSO selects them all subject to 
a partial positive cone condition, see the proof of Theorem 13.11 If the partial positive 
cone condition is not satisfied, the sequential LASSO generally does not select all 
those features. The sequential LASSO can be easily extended as a sequential penalized 
likelihood method for generalized linear models but there is no obvious way by which the 
OMP can be extended. We will explore the properties of extended sequential penalized 
likelihood method in our future research. 

3 Selection consistency of sequential LASSO with ultra- 
high dimensional feature space 

We establish in this section the selection consistency of the sequential LASSO when 
the dimension of the feature space is ultra-high, i.e., Inpn = 0{n'^),K > 0, under 
two different settings of the feature matrix X: (i) X is deterministic and (ii) X is 
random. The deterministic case is dealt with in §3.11 and the random case in §3.21 
Some interesting special cases are discussed in ^3.31 The sequential LASSO with EBIC 
as the stopping rule is considered in §3.41 



3.1 The case of deterministic feature matrix 

In the deterministic case, the columns of X are normalized such that the sample mean 
and variance of each feature are and n respectively. We now introduce some notations. 
For s C S, let s~ =3^(1 sq. Recall that sq is the set of indices of the nonzero /3j's. If 
s C So then s~ is the complement of s in sq. For s C sq, define 

jn{j,s,(3) = -XJ[I-H{s)]Xf3. 

In fact, 7„(j, s,/3) only depends on /3(s^). But for the ease of notation, /3 and /3(s'^) 
will be used interchangeably. Unless otherwise stated, /3 also denotes the unknown true 
value of the parameter vector. The selection consistency of the sequential LASSO in 
the case of deterministic feature matrix is established under the following assumptions. 

Al maxje^c |7„(j,s,/3)| < gmax^-g^- |7„(j, s,/3)|, < g < 1. 
A2 (Partial positive cone condition). Let 

A = {j : j e s'', |7„(i,s,/3)| = max|7„(j,s,/3)|}. 



and X{As) = [I - H{s)]X{As). Then [X^{As)X{As]-'^l > 0, where 1 is the 
vector with ah components 1. 



A3 i7^Amm[7r-^^(so)-'^('5o)l ™in \Pi\ -^ +oo, as n — )• oo, where Amm denotes the smah- 

ATI Pn '- ' t ^ ^ ^ "J P Sn 



est eigenvalue. 
Assumption Al is imphed by the following condition 

\\xjx{s-)[x^s-)x{s-)]-^\\i < 1 - r,,yj e sg, (3.1) 

where Xj = [/ — H(s)]Xj and < 77 < 1. The claim above follows because 

|7n(j>,/3)| = -\XJ[I-HisM 

= \XJX{s-)[X^{s-)X{s-)]-'^X^s-)[I - H{s)]f,\ 

< \\XJX{s-)[X-{s-)X{s-)]-'\\i-\\X^s-)[I - H{s)]ti\\oo 

< (1 - r])-\\X^{s-)[I - H{s)M^ = (1 - 7?)- max|Xn/ - H{s)]ti\ 

n n j(^s- ■' 

= (1 -?7)max|7„(j,s,/3)|, 



where the strict inequality holds by (j3.ip . 

Under assumption Al, the As in A2 is a subset of sq. Assumption A2 holds if and 
only if 

XjX{As\{j})[X^As\{j})X{As\{j})]-'l < l,Vi E As. (3.2) 

We establish the equivalence of A2 and (j3.2p below. Let A = X{As\{j}) and b = Xj. 

Since a permutation of the rows and columns does not change the sum of the rows, 

( A'^ A A'^b\ ~^ 
it suffices to verify that the sum of the last row of I , ,- . lt. I is positive if and 

only if b'' AiA" A)-^l < 1. Let S = / - A{A''A)~^A' and F = I - b{b^b)^b\ By the 
formula for the inverse of blocked matrices, we have 



fA^A A^b\ ^ 
^b^A b^b^ 


( {A^FA)-'^ -{A^A)-^A''b{b'Eb)-^ 
~ V-(b^&)-ib^A(^^FA)-i (b^Eb)-^ 


and 




(A^FA)-^ 


= [A^A-A^b{b^b)-^b^A]-^ 



= (A^A)-^ + {A^A)-^A^b^Eby^b^A{A^A)-\ 

Substituting the expression of (A'^FA)^^ into the first block of the last row of the above 
matrix, we obtain 

-{b^b)-^b^AiA^FA)-^ = -ib'^Eb)-^b'^A{A^A)-\ 



Thus the sum of the last row becomes 

{b^Eb)-^ - {b^Eby^b^A{A^A)-H = {b^Eb)-^[l - b^A{A^A)-H] 

which is greater than if and only if b'^A{A'^A)^^l < 1. 

Condition (j3.ip is a conditional version of ERC conditioning on the subset s of the 
relevant features. Condition ()3.2p is similar to but much weaker than the irrepresentable 
condition. The above arguments suggest that Conditions Al and A2 might be weaker 
than the ERC and the irrepresentable condition. This is indeed the case. We will 
demonstrate this by special cases in ^3.31 where the conditions for the selection consis- 
tency of the sequential LASSO hold but the ERC and the irrepresentable condition are 
not satisfied. If Amin(^-^^('So)-'^^(so)) is bounded away from zero, which is a common 
assumption in the case of ultra-high dimensional feature space, then Condition A3 is 
equivalent to 7^^min|/3j| — )• oo. If lnp„ = 0{n'^) with k < 1/2 and min|/3,| > Cn~ 

for some constant C and 6 < 1/2 — k, A3 is then satisfied. 

We now state and prove the major theorem in the following. 

Theorem 3.1. Suppose that assumptions A1-A3 hold. Let lnp„ = 0{n'^), where k < 
1/2. Then the sequential LASSO is selection consistent in the sense that 

Pr{s^k* = So) — ^ 1; as n ^ oo, 

where s*fc* is the set of features selected at the k*th step of the sequential LASSO such 
that \s^k*\ = Po, So is the set of relevant features and pQ = \so\. 

Proof. By Proposition 12. 3^ at the {k + l)st step of the sequential LASSO, the solution 
satisfies 

2X^y-XP) = Xd\mu (3.3) 

where y = [I - H{s^:k)]y, X = [I - H{s^:k)]X{s'^^), and 5||^||i is a sub gradient of ||/3||i 
at whose components are 1,-1 or a number with absolute value less than or equal 
to 1 according as the components are positive, negative or zero. For k = 0, s^kq is taken 
as the empty set (j). Obviously, s*o C sq. Assume that s^.^ C sq and \s^:k\ < Pq. Let 



1„... „, .. ,. ^. 1 

Define 



iniJ,s*k,P) = -XJ[I - H{s^k)]y = 1n{j,s*k,P) + -X^I - H{s^.k)]e. 
n ■' n ■' 



A = {j ■■ |7n(j>*fc,/3)l = max |7n(j, ■s*fc,/3)|}- 

We are going to show that, with probability converging to 1, Ak C sq and that Ak 
is the set of non-zero elements of the solution to equation (j3.3p . We first show that 
Ak C So, which is implied by \in{j,s^k,P)\ > max/g^g \inil,s^k, (3)\ for j £ s'j^ with 
probability converging to 1. The statement is established by showing 

(i) ^XJ[I - H{s^k)]e = Opirr^l'^ lnp„) uniformly for all j G s%^. 

10 



(ii) For j £ s^j^, max. - |7n(j>*fc,/3)| > CnU ^/^Inp^ for C„ -> oo. 

Notice that XJ[/ - i?(s*fc)]e ~ iV(0,o-2||Xj||^) where ||Xj||| < \\Xj\\l = n. Hence 

P(-\XJ[I - H{s,k)]e\ > an-^/^lnpn) 
n ■' 

= P{\X][I - H{s,k)]e\ > an^'^lnpn) 
< P{\X][I - H{s,u)]e\ >a||l,||2lnp„) 

- P(|.|>lnp„)<^exp{-(^}, 

where z is a standard normal random variable. Thus, by Bonferroni inequality, 

P(max-|Xj[/-/7(s,fc)]e| > an-^'^\npn) < ^ exp{- ^^"^"^ +\npn] ^ 0. (3.4) 

Thus (i) is proved. 

Let A(s*fc) = /i'^[/ — H{s^k)]fJ' where fx = X(3. We have the following inequalities 

Ms*k) = Yl f^J^J^^ - ^(«*fc)]/^ ^ ll/3K~fc)lli max |7n(j,s*fc,/3)|, (3.5) 

and 

A(5*fc) =/3^(.;,)x- (.;,)[/ - F(.,fe)]x(.- )/3(s;,) 

>A^in(x^(s;,)[/ - His,k)]xis:,))ms:k)\\l (3.6) 

*fc)ll2- 



>A,nin(X^(so)X(so))||/3(s;^"' 



The second inequality above follows since s*fcUs~^ = sq and {X"^ {s^j^)[I — H {s^:k)]X {s^j^))^^ 
is a sub-matrix of {X'^{sq)X{so)~^ by the formula of the inverse of blocked matrices. 
Combining (j3.5p and (j3.6p yields 



max|7„(j>,fc,/3)| > A^i„(lx-(5o)X(so))^^%^ 

> X^in{-X^so)X{so))inm\/3j\ 
n jeso 

= C„n"^/^ In p„, say, 

with C„ = ^—Xram{nX'^{so)X{so))uimj,^so\l3j\- The second inequality above holds 
since |s:J||/3(si)||i > ||/3(s;,)||? > k;,! min,-eso l/3o,|||/3(s;fc)||i. C„ ^ oo by A3. Thus 
(ii) is proved. 

By Al and (ii), 

I max|7„(j,s*fc,/3)| - max|7„(j, s*fc,/3)|| 
> (1 -g)max|7„(j, s^fc,/3)| > (1 - g)C„n"^/^lnp„. 

11 



This fact and (i) then imply that 7n(j, s*fe,/3) must attain the maximum within s^^,. 
Therefore, Ak C s~^ C sq. 

Without loss of generality, assume that 7n(i;S*fc)/3) > for all j G Ak- Consider 
"fnij, s*k,^) as a function of $,. Since the function is continuous, for each j S Ak, there 
exist a neighborhood Mj = {^ : ||^ — /3||2 < <5j} and a constant Cj > such that, for all 
^ G A/^-, 7n(j) •s.Kfc, ^) — niax;g_4c |7'~„(/, s*fc, ^))| > Cj. Here ^| denotes the complement of 
Ak in s^^ by an abuse of notation. Let M = {$ : ||^ — /3||2 < 5} where 6 = uim6j. Then 
for all $, gM, miuje^^ 7„(j, s^k,$) - max^g^c \j^{l, s^k,$))\ > C, where C = maxcj. 

Now construct (3 as follows. Let /9(A) = u}[X''iAk)X{Ak)]'^l and f3{Al) = 0, 
where a; > 0. By A2, f3{Ak) > 0. Take oj small enough such that l3 — e M. Thus we 
have miuj-g^^ iniJ,s*k,P - P) > maxjg^c |7~„(/, s*fc,/3 - /9))|- On the other hand, for 
any j £ Ak, 

in{j,s^k,p-p) = ma^iniJ,s^k,P)-u}-XJX{Ak)[X^{Ak)X{Ak)r^l 

= max7„(j, s*fc,/3) • 

Let A = 2n[maxj^si^ inij, s*k,P) - ^]- Then, we have 

2Xj{y-X$) = X, for j e Ak, 
2X]{y-XP)<\ for j (^ Ak- 

Let d\i3j\ = 2XJ{y - XP)/X for j ^ Ak, and 1 for j e Ak- Then apHi with these 

components is a sub gradient of ||/3||i at (3 and /3 solves equation (13. 3p . From the 
construction of /3, all the features corresponding to the non-zero components of /3 
belong to sq. Hence s^:k+i C sq. Thus we have shown that, given s*fc C sq, s^:k+i C sq 
with probability converging to 1. 

If Po is bounded then we have already established the selection consistency of the se- 
quential LASSO. If po diverges as n — )■ oo, we need to show that s*fe C sq, k = 1, . . . ,po, 
simultaneously, with probability converging to 1. Note that, under the assumptions, 
s^k+i C So is equivalent to miuje^^ in{j,s*k,P) > maxjg^c \in{l,s^k, (3))\ which is im- 
plied by P(maxjgsc^ ^|Xn/ — H{s^k)](-\ > o"n~^/^lnp„) — )■ 0. Therefore, when po is 
divergent, the selection consistency is established if 

P{ max max — |Xn/ — i?(,s*fc)]e| > cm^ ' Inpn) —^0, as n — )• oo. 
o<k<pojesif, n ■' 

It follows from ()3.4p and the Bonferroni inequality that 



1 



P( max max — |Xn/ — i:/^(s=Kfc)]e| > cm ' lnp„) 



3 



< ^exp{-^i^ + 21nM^0, 
inpn 2 
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since po < Pn- The proof is completed. □ 

3.2 The case of random feature matrix 

Instead of considering X as a fixed design matrix, we now assume Xi = (xn, . . . , Xip^Y, 
i = l,...,n, are i.i.d. copies of a random vector z = {zi, . . . , Zp^Y . Without loss 
of generality, assume that Ez = and Var(2;) = S with diagonal elements 1 and 
off-diagonal elements independent of n. Assume that 

al The off-diagonal elements of S are bounded by a constant less than 1; that is, 
the correlation between any two features are bounded below from —1 and above 
from 1. 

a2 o"max = niaxKj fc<p^ o"(zjZfe) < oo where a{zjZk) denotes the standard deviation of 

Zj Zk ■ 

a3 maxi<j^fc<p^ Eexp{tZjZk) and maxi<j<p^ E exp{tzje) are finite for t in a neighbor- 
hood of zero. 

For any s,s C S, denote by S^g the sub matrix of S with row indices in s and column 
indices in s. Define 

r{j,s,(3) = i^jS -^js^Js^ss)P- 

The following assumptions are imposed: 

Al' For any s C sq, s ^ sq, max^gsg \T{j,s,f3)\ < max^g^- \T{j,s,(3)\. 
A2' Let As = {j : j G s^,\T{j,s,f3)\ = max;g,e \T {I , s , f3)\} . Then 

{^AsAs - ^AsS^SS ^sAs)' 1 > 0. 

/ 1/2 

^^ Ei;;:'^min(Ssuso)(minjgs(, \f3j\) -^ +00 as n -^ +oo. 

The assumptions Al - A3 are in fact the assumptions A1-A3 with the empirical 
variances and covariances of the features replaced by their theoretical counterparts. 
In order to establish the selection consistency of the sequential LASSO in the case of 
random feature matrix, we need to pass from assumptions Al - A3 to assumptions 
A1-A3. The following lemma ensures that if Al - A3 hold then A1-A3 hold with 
probability converging to 1 as n goes to infinity. 

Lemma 3.1. Under assumptions al-o3, 

(i) P{maxi<j^k<p„ \tiJ27=i ^ijXik - ^jk\ > "-"^CTmax) -> 0. 



(ii) P(maxi<j<p„ |^ 1]"=! ^ij^i\ > "^ sa) -^ 0. 
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(iii) Let T^j^g = T.ji - T,jsT,jEsi and tj^g = XJ[I - H{s)]Xi/n. Then 

max max |S,7|s - S,7|s| = Op(l). 

l<j,'<PnS:|s|<po 

Proof. : For any j, k G {1, 2, • • • ,pn} it follows from [7] that 

Ell) ih 

XijXik - nSjfcl > ^/na{zjZk)tpn) < C[l - $(^„)] exp[^A(^)] 



(3.7) 



where C is a constant, ^{■) is the cumulative distribution function of standard normal 

distribution, A(-) is the Cramer series for the distribution of ZjZ^ which converges in a 

neighborhood of zero under assumption a3, and ipn is a sequence satisfying ipn = o{n^''^) 

and ipn — ^ oo. 

1 ih ih 

Now take ipn = n6~ for < 5 < g — ^. Then A(— ^) is bounded and — ^ goes to 



as n converges to oo. Thus (]3.7p leads to 



Pi\ '^XijXik - nSjyfcl > n3 "^(Tmax) 
n 

< P{\ ^ XijXik - nT,jk\ > mi~^a{zjZk)) 



< ^-Fexp(--n3 ^'^), 

where Ci is a generic constant. Let p„ = exp(an'') where a > and k < g. By 
Bonferroni inequality. 



P( max 



n 

/ ^ XijXik nlujk 

i=l 



>n'i Vmax)=o(n 6+^)-;.0. 



Hence (i) is proved. The proof of (ii) is similar and is omitted. 

Note that, for Xj, Xi and X(s), ^Xlj{I- X{s)[X^{s)X{s)\-'^X^{s))Xi is a continu- 
ous function of_the means ^ Ya=i ^ijXu, ^ Y17=i XijXik, ^ Ya=i xnXik and ^ Ym=i XikXim, 
k,m £ s. Let Xji^ denote the vector consisting of these means and /x-^g its expectation. 
The function depends on \s\ but not on n. Let g\s\iXjis) denote this function. We then 

have g\s\{fJ-jis) = ^ji\s- 

By assumption al, the range of /Xj;g for all j,l,s with fixed |s| is compact. Hence 
g\g\ is also uniformly continuous for all {j,l,s) with fixed |s|. Thus for any rj > there 
is a C > such that if \\Xjis - fijisWoo < C then \g\s\{Xjis) - S-lsK/^jiJI < ??, where C 
does not depend on (j, I, s). From the proof of (i), we can choose a no such that when 
n > no, 

>() = o(n-s+'5). 



P{ max 



/ ^ XijXjk ^jk 
i=l 
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Thus we have 



3 

By Bonferroni inequahty, 






.-h+&\ 



P(max max \g\AXjis) - 5|s|(/^,7s)l > r]) < o{n 6+ )pQ -> 0, 

J,i s;|s|<po 

for pq = 0{n6^ ). (iii) is proved. 
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Theorem 3.2. Let lnp„ = 0{n''), k < 1/3, and po = Oin"), k/2 < c < 1/6. The 
sequential LASSO is selection consistent with random feature matrices that satisfy con- 
ditions al-a3 and A\ -AZ . 

The theorem is in fact a corollary of Lemma 13.11 It follows from the lemma im- 
mediately that if al-a3 and Al -A3 are satisfied then A1-A3 hold with probability 
converging to 1. Thus the selection consistency of the sequential LASSO with random 
feature matrix is established. 

3.3 Special cases 

In this sub section, we provide two special cases where the conditions for the selec- 
tion consistency of the sequential LASSO can be directly verified. The first special 
case concerns constant positive correlation among the features. In this case, for the 
irrepresentable condition to be satisfied, some restriction must be imposed. But such 
restriction is not needed for sequential LASSO. The second special case deals with a 
correlation structure under which the irrepresentable condition is violated. 

Special case I: Let the correlation matrix of z be given by 

s = (i-p)/ + pir, 

where I is the identity matrix of dimension p„, 1 is a p„-vector of all elements 1, and 

< p < Po < 1- Note that p is allowed to depend on n. But for the ease of notation we 
don't make this dependence explicit. In this case, the assumptions Al -A3 are satisfied 
with miuj-gsQ \/3j\ = Cn~^''^~^ for some constant C and an arbitrarily small positive 6. 
The claim is verified in the following. 

For any s C S, the sub correlation matrix S^s has eigenvalues 1 — p and 1 + (|s| — 
l)p with multiplicities |s| — 1 and 1 respectively. The eigenvector corresponding to 

1 + (|s| — l)p is 1 with dimension |s| . The smallest eigenvalue is 1 — p. Thus A3 follows 
immediately. 
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Now suppose s C sq. For any j, k £ s'^, we have 

2| 



l + i\s\-l)p 
{ ( l-p){p\s\ + X) _ 

l + (|s|-l)p -"' '^^-^ 

'^'-'^ -b, ifj^k. 



[ l + {\s\-l)p 



Therefore, 



sk) 
kGs- 

ia-b)^,+bZkes-l^k = bZkes-Pk + il-p)l3,, foriGs", 
bJ2kes-f^k, forjGsg. 



Thus 



maxk f7 s /S^l - / \^^kes- Al + (1 " P) max^-e,- /3j- if Zkes- h > 0, 
.^^ '^"^^''^^'" I lbE.es- Al + (1-P)l mm,,,-/3,| if E.e.-A<0. 

Obviously, max^g^- |7„(j, s,/9)| > maxjgscj7„(j, s,/3)| and hence Al is satisfied. Fi- 
naUy, we have 



,2-1 -I rvn — 1-1 -I T 



^AsAs ~ ^Ass'^SS ^ 

= (1 - p}i + piv - p^rE-'u 

l + (|s| -1)/) 

Let ly be the number of elements in As- The eigenvalue of the above matrix correspond- 
ing to the eigenvector 1 is 

l + i\s\ -l)p 



Hence 



i^AsAs - ^AsS^SS^sAs) ^1 = ^, / _ j^x^ l > 0, 



i.e., A2 holds. 

Note that, in the above argument, we only need p = Pn ^ Po < ^- But, for the 
irrepresentable condition to hold, the following restriction must be in place: 

1 

Pn < :r- . 1 

1 + c\so\ 
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for some constant c, see fl7\. If |so| — ^ oOj Pn must go to zero, i.e., eventually, all the 
features must be statistically uncorrelated. 

Special case II. Without loss of generality, let sq = {1, . . . ,po}- Assume that 

(i) |/3i| > I/32I > • • • > |/3pol = Cn^^''^^ for some constant C and an arbitrarily small 
positive 5; 

(ii) The correlation matrix S has the following structure: 

Esoso = I, ^jso = — sign/3(so)^, for j E Sq. 
Po 

In the following, we show that in this case the irrepresentable condition is violated 
but conditions Al -A3 hold, and if in addition a2 and a3 are assumed, the sequential 
LASSO is selection consistent. Obviously, 

^jso^Jo^soSignPiso) = 1, 

i.e., the irrepresentable condition does not hold. Let s^kq = </*• Suppose s*a; = {1, • • • , ^} 
for k < pq. For any j G Sq, 

V /3(s8) 

= max|r(j, s*fc,/3). 

Thus Al is satisfied. The validity of A2 is obvious since As^,. contains only one element 

for each k < pQ. A3 reduces to j^— - minj-g^o \Pj\ — >■ 00 which holds obviously, al follows 
from (ii). Then, when a2 and a3 are also satisfied, the sequential LASSO is selection 
consistent. 

3.4 Sequential LASSO with EBIC as stopping rule 

In the previous sub sections, we have shown that, if we know |so| a priori and stop the 
sequential LASSO when the number of selected features is |so|, then the set of selected 
features will be exactly the set of causal features with probability converging to 1. But 
in practice we need a workable stopping rule for the sequential LASSO since |so| is 
unknown. The extended BIG (EBIC) proposed in [^ serves as a suitable stopping rule 
because of its desirable properties of selection consistency. The EBIC is defined as 
follows: 

T?T3Tn t ^ 1 f\\y - Hi^)y\\2\ ,111 101 fpA ^n 

EBlC^(s) = rem + s mre + 27ln , 7 > 0. 

\ n J \\s\J 
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The selection consistency of the EBIC for linear regression models is established under 
different assumptions on pn and | sq | in [3j and [H] . The following theorem (with slight 
changes) is quoted from [TT]: 

Theorem 3.3. Assume model \2.1\) and the condition: 



lim min{ : Son ^ s, \s\ < /c„} = oo, 

where kn = kpo^ for any fixed k > 1. In addition, assume that p„ = 0(exp(n'^)) 
forO<K<l, pon = 0{n^), min{|^„j| : j £ son} = 0{n-^^-^')/^), < c,k < I, 
c + K<b<l. Then, z/7>l-^, 

P{ min EBIC^{s) > EBICy{son)} ^ 1- 

s:\s\<k„ 

Strictly speaking, the EBIC is to be used as a selection rule rather than a stopping 
rule in the procedure of sequential LASSO. The procedure described in f|2] needs to be 
slightly modified. In the modified procedure, instead of stopping the sequential LASSO 
when some stopping criterion is met, the procedure is carried out for a specified K steps, 
where K is of order 0{n). Then EBIC-y(s=Kfc), 1 < k < K, are computed and compared. 
The s^,fc that minimizes the EBIC is then selected. It is easy to see that, if the conditions 
for Theorem 3.3 and Theorem 3.1 (in the case of fixed feature matrix) or Theorem 3.2 
(in the case of random feature matrix) hold, the selected set of features will be exactly 
the set of causal features with probability converging to 1. 

In actual implementation of the procedure, K can be chosen as rn for some < r < 
1. If, for a given r, the minimum EBIC attains at K or near K, then raise r to a larger 
value. Eventually, the minimum EBIC will attain at a /c which is much less than K. 

Another issue is the choice of 7 in EBIC. The selection consistency is an asymptotic 
property. With a finite sample, the selected set will not be exactly the same as the set of 
causal features. There will be causal features that are not in the selected set. There will 
be also non-causal feature that are selected. The selection accuracy is characterized by 
two measures, positive discovery rate (PDR„) and false discovery rate (FDR„), which 
are defined below. Let s* be the selected set of features. Then 

PDR„= '^'^^7"' , FDR„= l^^-^^O"' . 

[■SOnl I '5*1 

The selection consistency is equivalent to that FDR„ — )• and PDR„ — t- 1. Although, 
the EBIC is selection consistent as long as 7 > 1 — ^j^" , the convergence rate of PDR„ 
and FDR„ are different for different 7 values. For bigger 7, the FDR„ is smaller but the 
PDR„ is also smaller. A reasonable strategy is to maximize PDR„ when the selection 
consistency is still retained. Thus a reasonable choice is 7 = 1 — ^^y^" for some r 
slightly larger than 1. In our numerical studies, we take r = 1.5. 
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4 Numerical Study 

We report in this section our simulation study on the comparison of the sequential 
LASSO with FSR and ordinary LASSO. The comparison is made in two different ways. 
In the first way, the sequential LASSO and FSR are stopped at step po, the solution 
path of LASSO is computed until po features having non-zero coefficients. In the second 
way, Po is replaced by 50 which corresponds to 0.5n for n = 100, 0.25n for n = 200 and 
O.ln for n = 500, and the EBIC with 7=1 — lnn/31np„ is used to select the final set 
as described in ^3.4[ 

The diverging pattern of pn and pon are taken in consistence with the theorems on 
the sequential Lasso and EBIC as {pon,Pn) = ([4n°-^^], [5exp(n°-3)]). For n = 100,200 
and 500, this yields the following table: 



n 


100 


200 


500 


Pn 


268 


672 


3,170 


POn 


8 


9 


11 



Two types of coefficients for causal features are considered. For the first type, the co- 
efficients are generated as independent random variables distributed as (— 1)"(4?7-~'^'^^ + 
l^l), where u ~ Bernoulli{0.4:) and z is a normal random variable with mean and 
satisfies P{\z\ > 0.1) = 0.25. The coefficients take both positive and negative values 
and are roughly of order 0{n~^'^^). For the second type, the coefficient are generated as 
2j0.5j^-o.i5^ 1 < j < pq„. The coefficients are all positive and the minimum magnitude 
has order 0{n~^'^^) while the maximum magnitude has order 0{n~^'^'^). 

The error variance a^ is determined by setting the following ratio to certain values: 

h ^^^^ 



where /3 is the true parameter vector and S is the covariance matrix of the predictors. 
Two sets of simulation study with different correlation structures of the features are 
considered. In the first set, the correlation structure for causal and non-causal features 
are not distinguished. In the second set, different correlation structures are assumed 
for causal and non-causal features. The two sets of simulation study are referred to as 
simulation study A and simulation study B. The correlation structures for each study 
are described in the following. 

Simulation Study A 

• Structure Al: All the Pn features are statistically independent with mean zero 
and variance 1. 

• Structure A2: The pn features have a constant pairwise correlation, i.e., S = 
(1 — p)I + pll^ , where I is a p„ dimensional identity matrix and 1 is a, pn 
dimensional vector of elements 1. 
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Structure A3: The S satisfies Sjj = pi* ^' for all i,j = 1,2,- •• ,p„. The true 
features are scattered in clusters of size 3 or 2. 



Simulation Study B 

• Structure Bl: Let Zi,--- , Zp^ and , Wi,--- ,WpQ^ be i.i.d. random vectors 
with distribution N(0,I). The feature vectors are generated as: 

Xj = ^ ' , for J G son; Xj = for j ^ SOn- 

\/2 V 1 + POn 

• Structure B2: The features in son have constant pairwise correlation. Let 
Xj,j G SQn be the causal feature vectors generated accordingly. For j squ, the 
feature vectors are generated as: 



X-i = Ei + 



Ek&snr,. ^k 



'■J — '^3 ^ „ ' 

POn 

where e^-'s are independent vectors from N(0, 0.08 * I„). Here the variance of ej is 
set to 0.08 in order for the second term, which is correlated with causal features, 
to dominate the variance. 

• Structure B3: The features are generated in the same was as in Structure B2 
except that the causal features are generated according to the covariance matrix 
E with Sjj = />l*~-'l and son set to {1, 2, • • • ,pon}- 

For each setting of the simulation studies, the PDR and FDR are averaged over 
200 replicates. For simulation study A, we only report the results with the first type of 
coefficients, since the results with the second type of coefficients are similar. The results 
of simulation study A are given in Table ITTl For simulation study B, the results with 
different types of coefficients are quite different. The results of simulation study B are 
given in Table [3]2] and |3]3] respectively for the first and second type of coefficients. For 
the structures involving the parameter p, we considered two values of p: 0.3 and 0.5. 
The relative performance pattern among the three methods are the same for different 
p values. Only the results with p = 0.5 are reported in the tables for the sake of clarity. 
The findings of the simulation studies are summarized and discussed below. 

Simulation study A is discussed first. When the features are all independent (Al), 
the SLasso and FSR are slightly better than Lasso in terms of both PDR and FDR. 
But Lasso is still comparable with SLasso and FSR. When the features have constant 
pairwise correlation (A2), the SLasso and FSR are prominently better than Lasso, 
especially when EBIC is used for feature selection, and SLasso and FSR are comparable. 
When the correlations have an exponential decay (A3), again, the three methods are 
comparable with SLasso and FSR slightly better than Lasso. The simulation results 
are quite close to each other when the known pon is used as the stopping rule and when 
EBIC is used for selection. Simulation study A demonstrates that, under the assumed 
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covariance structures, both SLasso and FSR perform better than Lasso. However, this 
study does not distinguish between SLasso and FSR. 

We now turn to simulation study B. The following is the finding in the case of first 
type of coefficients. When pon is used as the stopping rule, the SLasso is better than 
Lasso which in turn is better than FSR in terms of higher PDR and lower FDR, and the 
differences are quite significant. The same pattern prevails under all three correlation 
structures. When EBIC is used for selection, still, both SLasso and Lasso are much 
better than FSR, the relative performances of SLasso and Lasso are reversed. But 
SLasso is not much worse than Lasso. 

In the case of second type of coefficients in simulation study B, Lasso seems to be 
totally off target no matter whether pon is used as the stopping rule or EBIC is used for 
selection. SLasso is much better than FSR under all the three correlation structures 
when pon is used as the stopping rule. When EBIC is used for selection, SLasso is still 
better than FSR under correlation structure Bl and B3. Under correlation structure 
B2, FSR has higher PDR but also higher FDR, it is hard to say which one is better. 

In general, SLasso manifests itself as the best procedure in simulation study B. This 
is not a coincidence. It is due to the intrinsic property of SLasso which Lasso and FSR 
lack, see ^ Under the correlation structures in simulation study B, the non-causal 
features are highly correlated collectively with all the causal features and also with each 
other among themselves. When causal features and/or some of non-causal features are 
selected, the SLasso has the capacity to resist selecting other non-causal features. 

To conclude, both simulation study A and B provide evidences for the advantage 
of using SLasso: in most of cases, SLasso is the best procedure; in cases where SLasso 
in not the best, it is comparable with the best. In a sense, it is robust over different 
correlation structures of the features. 
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Table 4.1: Results of Simulation Study A with First Type of Coefficients 



Sequential procedure stopped at step pon 




Struc. 


PDR 


FDR 


n 


Lasso FSR SLasso 


Lasso FSR SLasso 


100 


Al 
A2 
A3 


.940(.084) .990(.035) .989(.035) 
.642(.170) .675(.183) .659(.195) 
.778(.132) .815(.151) .782(.169) 


.060(.084) .010(.035) .011(.035) 
.358(.171) .325(.183) .341(.195) 
.222(.132) .185(.151) .218(.169) 


200 


Al 
A2 
A3 


.999(.011) l.OO(.OOO) l.OO(.OOO) 
.623(.107) .638(.129) .638(.124) 
.838(.098) .851(.112) .840(.125) 


.OOl(.Oll) .OOO(.OOO) .OOO(.OOO) 
.375(.109) .362(.129) .362(.124) 
.162(.098) .149(.112) .160(.125) 


500 


Al 
A2 
A3 


l.OO(.OOO) l.OO(.OOO) l.OO(.OOO) 
.732(.142) .736(.144) .736(.144) 
.893(.089) .893(.098) .891(.103) 


.OOO(.OOO) .OOO(.OOO) .OOO(.OOO) 
.268(.142) .264(.144) .264(.144) 
.107(.089) .107(.098) .109(.103) 


Final set selected by EBIC 




Struc. 


PDR 


FDR 


n 


Lasso FSR SLasso 


Lasso FSR SLasso 


100 


Al 
A2 
A3 


.989(.053) .992(.031) .992(.031) 
.472(.282) .503(.283) .496(.281) 
.768(.198) .799(.174) .783(.188) 


.071(.099) .049(.080) .051(.083) 
.108(.189) .076(.169) .088(.184) 
.104(.133) .101(.128) .118(.138) 


200 


Al 
A2 
A3 


l.OO(.OOO) l.OO(.OOO) l.OO(.OOO) 
.403(.221) .417(.232) .417(.229) 
.839(.122) .839(.128) .836(.132) 


.011(.033) .029(.053) .029(.053) 
.044(.108) .011(.047) .014(.050) 
.070(.113) .058(.087) .064(.093) 


500 


Al 
A2 
A3 


1.00(0.00) l.OO(.OOO) l.OO(.OOO) 
.679(.205) .684(.202) .682(.203) 
.894(.097) .888(.106) .888(.106) 


.013(.036) .017(.038) .017(.038) 
.083(.128) .112(.137) .111(.137) 
.029(.060) .026(.056) .027(.056) 



23 



Table 4.2: Results of Simulation Study B with First Type of Coefficients 



Sequential procedure stopped at step pon 




Struc. 


PDR 


FDR 


n 


Lasso FSR SLasso 


Lasso FSR SLasso 


100 


Bl 
B2 
B3 


.682(.368) .477(.276) .728(.260) 
.778(.301) .628(.366) .924(.126) 
.783(.302) .645(.336) .902(.164) 


.430(.325) .523(.276) .456(.213) 
.268(.278) .372(.366) .228(.155) 
.254(.282) .355(.336) .227(.181) 


200 


Bl 
B2 
B3 


.665(.343) .508(.285) .729(.244) 
.727(.314) .569(.366) .939(.104) 
.788(.303) .689(.360) .909(.155) 


.497(.280) .492(.285) .506(.187) 
.305(.303) .431(.366) .248(.148) 
.245(.290) .311(.360) .268(.185) 


500 


Bl 
B2 
B3 


.686(.358) .478(.263) .756(.249) 
.694(.381) .644(.376) .963(.103) 
.766(.339) .726(.362) .917(.163) 


.552(.264) .522(.263) .590(.158) 
.421(.322) .356(.376) .321(.156) 
.285(.316) .273(.362) .230(.205) 


Final set selected by EBIC 




Struc. 


PDR 


FDR 


n 


Lasso FSR SLasso 


Lasso FSR SLasso 


100 


Bl 
B2 
B3 


.685(.372) .532(.262) .751(.319) 
.853(.349) .656(.345) .647(.393) 
.859(.259) .661(.327) .728(.378) 


.328(.322) .816(.179) .727(.232) 
.188(.340) .789(.195) .291(.246) 
.198(.266) .766(.218) .269(.253) 


200 


Bl 
B2 
B3 


.656(.344) .505(.288) .623(.334) 
.908(.287) .563(.372) .604(.377) 
.874(.244) .686(.365) .768(.349) 


.356(.319) .488(.230) .390(.275) 
.121(.285) .367(.254) .262(.208) 
.178(.270) .307(.258) .216(.228) 


500 


Bl 
B2 
B3 


.680(.362) .476(.265) .643(.350) 
.949(.214) .642(.378) .766(.359) 
.912(.215) .724(.365) .817(.310) 


.330(.356) .535(.217) .365(.322) 
.075(.219) .312(.285) .196(.223) 
.151(.279) .270(.290) .183(.230) 



24 



Table 4.3: Results of Simulation Study B with Second Type of Coefficients 



Sequential procedure stopped at step pon 




Struc. 


PDR 


FDR 


n 


Lasso FSR SLasso 


Lasso FSR SLasso 


100 


Bl 
B2 
B3 


.004(.029) .366(.115) .716(.123) 
.153(.141) .133(.100) .744(.127) 
.163(.136) .295(.087) .725(.123) 


.997(.021) .634(.115) .440(.145) 
.856(.137) .867(.100) .380(.129) 
.852(.125) .705(.087) .391(.131) 


200 


Bl 
B2 
B3 


.OOO(.OOO) .432(.093) .776(.103) 
.064(.110) .130(.098) .793(.125) 
.079(.107) .360(.077) .740(.120) 


l.OO(.OOO) .568(.093) .423(.135) 
.948(.090) .870(.098) .368(.133) 
.937(.085) .640(.077) .425(.151) 


500 


Bl 
B2 
B3 


.OOO(.OOO) .485(.072) .870(.102) 
.002(.014) .176(.093) .826(.116) 
.010(.038) .447(.068) .836(.lll) 


l.OO(.OOO) .516(.072) .376(.157) 
.999(.010) .825(.093) .399(.146) 
.994(.023) .553(.068) .398(.172) 


Final set selected by EBIC 




Struc. 


PDR 


FDR 


n 


Lasso FSR SLasso 


Lasso FSR SLasso 


100 


Bl 
B2 
B3 


.002(.035) .449(.168) .619(.319) 
.OOO(.OOO) .116(.161) .018(.045) 
.004(.033) .305(.159) .197(.150) 


.998(.024) .826(.151) .712(.226) 
l.OO(.OOO) .955(.110) .944(.148) 
.994(.047) .757(.225) .528(.170) 


200 


Bl 
B2 
B3 


.OOO(.OOO) .390(.113) .533(.199) 
.OOO(.OOO) .021(.054) .026(.058) 
.004(.039) .277(.074) .273(.092) 


l.OO(.OOO) .565(.090) .422(.126) 
l.OO(.OOO) .925(.178) .925(.163) 
.997(.033) .460(.100) .428(.130) 


500 


Bl 
B2 
B3 


.OOO(.OOO) .530(.083) .758(.198) 
.OOO(.OOO) .050(.063) .049(.065) 
.003(.042) .381(.066) .398(.097) 


l.OO(.OOO) .529(.062) .363(.136) 
l.OO(.OOO) .844(.187) .863(.182) 
.999(.018) .434(.062) .379(.126) 
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